21 research outputs found
From Volcano to Toyshop: Adaptive Discriminative Region Discovery for Scene Recognition
As deep learning approaches to scene recognition emerge, they have continued
to leverage discriminative regions at multiple scales, building on practices
established by conventional image classification research. However, approaches
remain largely generic, and do not carefully consider the special properties of
scenes. In this paper, inspired by the intuitive differences between scenes and
objects, we propose Adi-Red, an adaptive approach to discriminative region
discovery for scene recognition. Adi-Red uses a CNN classifier, which was
pre-trained using only image-level scene labels, to discover discriminative
image regions directly. These regions are then used as a source of features to
perform scene recognition. The use of the CNN classifier makes it possible to
adapt the number of discriminative regions per image using a simple, yet
elegant, threshold, at relatively low computational cost. Experimental results
on the scene recognition benchmark dataset SUN397 demonstrate the ability of
Adi-Red to outperform the state of the art. Additional experimental analysis on
the Places dataset reveals the advantages of Adi-Red, and highlight how they
are specific to scenes. We attribute the effectiveness of Adi-Red to the
ability of adaptive region discovery to avoid introducing noise, while also not
missing out on important information.Comment: To appear at the ACM International Conference on Multimedia (ACM MM
2018). Code available at https://github.com/ZhengyuZhao/Adi-Red-Scen
FakeCLR: Exploring Contrastive Learning for Solving Latent Discontinuity in Data-Efficient GANs
Data-Efficient GANs (DE-GANs), which aim to learn generative models with a
limited amount of training data, encounter several challenges for generating
high-quality samples. Since data augmentation strategies have largely
alleviated the training instability, how to further improve the generative
performance of DE-GANs becomes a hotspot. Recently, contrastive learning has
shown the great potential of increasing the synthesis quality of DE-GANs, yet
related principles are not well explored. In this paper, we revisit and compare
different contrastive learning strategies in DE-GANs, and identify (i) the
current bottleneck of generative performance is the discontinuity of latent
space; (ii) compared to other contrastive learning strategies,
Instance-perturbation works towards latent space continuity, which brings the
major improvement to DE-GANs. Based on these observations, we propose FakeCLR,
which only applies contrastive learning on perturbed fake samples, and devises
three related training techniques: Noise-related Latent Augmentation,
Diversity-aware Queue, and Forgetting Factor of Queue. Our experimental results
manifest the new state of the arts on both few-shot generation and limited-data
generation. On multiple datasets, FakeCLR acquires more than 15% FID
improvement compared to existing DE-GANs. Code is available at
https://github.com/iceli1007/FakeCLR.Comment: Accepted by ECCV202
MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models
The advent of open-source AI communities has produced a cornucopia of
powerful text-guided diffusion models that are trained on various datasets.
While few explorations have been conducted on ensembling such models to combine
their strengths. In this work, we propose a simple yet effective method called
Saliency-aware Noise Blending (SNB) that can empower the fused text-guided
diffusion models to achieve more controllable generation. Specifically, we
experimentally find that the responses of classifier-free guidance are highly
related to the saliency of generated images. Thus we propose to trust different
models in their areas of expertise by blending the predicted noises of two
diffusion models in a saliency-aware manner. SNB is training-free and can be
completed within a DDIM sampling process. Additionally, it can automatically
align the semantics of two noise spaces without requiring additional
annotations such as masks. Extensive experiments show the impressive
effectiveness of SNB in various applications. Project page is available at
https://magicfusion.github.io/
Unified Discrete Diffusion for Simultaneous Vision-Language Generation
The recently developed discrete diffusion models perform extraordinarily well
in the text-to-image task, showing significant promise for handling the
multi-modality signals. In this work, we harness these traits and present a
unified multimodal generation model that can conduct both the "modality
translation" and "multi-modality generation" tasks using a single model,
performing text-based, image-based, and even vision-language simultaneous
generation. Specifically, we unify the discrete diffusion process for
multimodal signals by proposing a unified transition matrix. Moreover, we
design a mutual attention module with fused embedding layer and a unified
objective function to emphasise the inter-modal linkages, which are vital for
multi-modality generation. Extensive experiments indicate that our proposed
method can perform comparably to the state-of-the-art solutions in various
generation tasks
PartSeg: Few-shot Part Segmentation via Part-aware Prompt Learning
In this work, we address the task of few-shot part segmentation, which aims
to segment the different parts of an unseen object using very few labeled
examples. It is found that leveraging the textual space of a powerful
pre-trained image-language model (such as CLIP) can be beneficial in learning
visual features. Therefore, we develop a novel method termed PartSeg for
few-shot part segmentation based on multimodal learning. Specifically, we
design a part-aware prompt learning method to generate part-specific prompts
that enable the CLIP model to better understand the concept of ``part'' and
fully utilize its textual space. Furthermore, since the concept of the same
part under different object categories is general, we establish relationships
between these parts during the prompt learning process. We conduct extensive
experiments on the PartImageNet and PascalPart datasets, and the
experimental results demonstrated that our proposed method achieves
state-of-the-art performance
Domain Re-Modulation for Few-Shot Generative Domain Adaptation
In this study, we delve into the task of few-shot Generative Domain
Adaptation (GDA), which involves transferring a pre-trained generator from one
domain to a new domain using only a few reference images. Inspired by the way
human brains acquire knowledge in new domains, we present an innovative
generator structure called Domain Re-Modulation (DoRM). DoRM not only meets the
criteria of high quality, large synthesis diversity, and cross-domain
consistency, which were achieved by previous research in GDA, but also
incorporates memory and domain association, akin to how human brains operate.
Specifically, DoRM freezes the source generator and introduces new mapping and
affine modules (M&A modules) to capture the attributes of the target domain
during GDA. This process resembles the formation of new synapses in human
brains. Consequently, a linearly combinable domain shift occurs in the style
space. By incorporating multiple new M&A modules, the generator gains the
capability to perform high-fidelity multi-domain and hybrid-domain generation.
Moreover, to maintain cross-domain consistency more effectively, we introduce a
similarity-based structure loss. This loss aligns the auto-correlation map of
the target image with its corresponding auto-correlation map of the source
image during training. Through extensive experiments, we demonstrate the
superior performance of our DoRM and similarity-based structure loss in
few-shot GDA, both quantitatively and qualitatively. The code will be available
at https://github.com/wuyi2020/DoRM.Comment: Under Revie
OmniForce: On Human-Centered, Large Model Empowered and Cloud-Edge Collaborative AutoML System
Automated machine learning (AutoML) seeks to build ML models with minimal
human effort. While considerable research has been conducted in the area of
AutoML in general, aiming to take humans out of the loop when building
artificial intelligence (AI) applications, scant literature has focused on how
AutoML works well in open-environment scenarios such as the process of training
and updating large models, industrial supply chains or the industrial
metaverse, where people often face open-loop problems during the search
process: they must continuously collect data, update data and models, satisfy
the requirements of the development and deployment environment, support massive
devices, modify evaluation metrics, etc. Addressing the open-environment issue
with pure data-driven approaches requires considerable data, computing
resources, and effort from dedicated data engineers, making current AutoML
systems and platforms inefficient and computationally intractable.
Human-computer interaction is a practical and feasible way to tackle the
problem of open-environment AI. In this paper, we introduce OmniForce, a
human-centered AutoML (HAML) system that yields both human-assisted ML and
ML-assisted human techniques, to put an AutoML system into practice and build
adaptive AI in open-environment scenarios. Specifically, we present OmniForce
in terms of ML version management; pipeline-driven development and deployment
collaborations; a flexible search strategy framework; and widely provisioned
and crowdsourced application algorithms, including large models. Furthermore,
the (large) models constructed by OmniForce can be automatically turned into
remote services in a few minutes; this process is dubbed model as a service
(MaaS). Experimental results obtained in multiple search spaces and real-world
use cases demonstrate the efficacy and efficiency of OmniForce
Null-text Guidance in Diffusion Models is Secretly a Cartoon-style Creator
Classifier-free guidance is an effective sampling technique in diffusion
models that has been widely adopted. The main idea is to extrapolate the model
in the direction of text guidance and away from null-text guidance. In this
paper, we demonstrate that null-text guidance in diffusion models is secretly a
cartoon-style creator, i.e., the generated images can be efficiently
transformed into cartoons by simply perturbing the null-text guidance.
Specifically, we proposed two disturbance methods, i.e., Rollback disturbance
(Back-D) and Image disturbance (Image-D), to construct misalignment between the
noisy images used for predicting null-text guidance and text guidance
(subsequently referred to as \textbf{null-text noisy image} and \textbf{text
noisy image} respectively) in the sampling process. Back-D achieves
cartoonization by altering the noise level of null-text noisy image via
replacing with . Image-D, alternatively, produces
high-fidelity, diverse cartoons by defining as a clean input image, which
further improves the incorporation of finer image details. Through
comprehensive experiments, we delved into the principle of noise disturbing for
null-text and uncovered that the efficacy of disturbance depends on the
correlation between the null-text noisy image and the source image. Moreover,
our proposed techniques, which can generate cartoon images and cartoonize
specific ones, are training-free and easily integrated as a plug-and-play
component in any classifier-free guided diffusion model. Project page is
available at \url{https://nulltextforcartoon.github.io/}